This report explores in depth numerous different White Wines.

Load the Packages

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

This dataset includes 11 different input variables with over 4898

observations of White Wines and 1 output variable (Quality).

I have decided to remove the uneeded variable X because it surves no real

purpose but to be used as a counter.

I decided to do a quick plot with every variable to try and better

understand the data and also to see the distribution. These plots are

of a normal distribution type. There appears to be numerous outliers with

pretty high counts of fixed acidity, volatile acidity, and sulphates.

Density appears to be the only plot with a limited number of outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

After plotting the above plots they also show a mostly normal distribution.

Clorides seem to be the only real prevailing variable with a decient amount

of outliers. pH, free and total sulfur also have a couple of outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

After plotting the residual sugar plot it is skewed to the left (prehaps it

is showing less White Wines in the distribution?) and the alcohol plot is

pretty spread out.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

After transforming the data using a log tranformation is shows a bi-modal

distribution. White Wines are considered sweet if they have a residual sugar

content greater then 45, and it appears there is less sweet White Wines with

a median and mean of residual sugars of approx. 5.2-6.4 and a Max of 65.8.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

After looking at the citric acid distribution it appears there is an

interesting spike in citric acid around .5. And after transforming the data

with log10 it appears to be mostly normal.

Univariate Analysis

What is the structure of your dataset?

The Dataset is made up of 4898 observation of White Wines with 11 inputs

(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,

free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol)

and 1 output (quality).

What is the main features of interest in your dataset?

I think the main features of this White Wine dataset are Alcohol(%) as

well as Residual Sugar. They are the 2 main variables that appear to not

have a normal distribution.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Chlorides, Volatile Acidity, and total sulfur dioxide seem to play a smaller

part in the quality of the White Wine. Citric acid also has an interesting

spike around .5.

Did you create any new variables from existing variables in the dataset?

I created a new variable called quality_fac to aid in the factoring of the

quality of some of my plots and to show better visualizations of the data

in the Multivariate Section below.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

Sulphates seemed to have a minor bimodal distribution and residual sugar

had a bimodal distribution when transformed with log10 . They only change

I made was the new variable as mentioned above.

Biavariate Plots

Using GGpairs to see any apparent correlations with the data. There appears

to be multiple correlations between a number of variables in the dataset to

explore including density/residual sugar and alcohol/quality, among many

I will explore.

This plot is also a great example of the correlation between all the

variables. It shows the strong relations between residual sugar/density and

alcohol/density as well as many others to explore.

## [1] 0.09942725

## [1] 0.4355747

## [1] -0.009209091

## [1] -0.09757683

I am comparing quality to numerous other variables to get a sense as to what

goes into making a high quality White Wine.

The best quality White Wines seem to have a pH of 3.0 to 3.5, alcohol content

of between 10 and 13, medium to high levels of citric acid (.25-.5), and low

residual sugars(0 - 18). I was pretty suprised to see the high alcohol count

as being a pretty good factor in high quality white wines. It also appears

the lower residual sugars wines are more appealing.

## WW$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## WW$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## WW$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## WW$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## WW$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## WW$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## WW$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## WW$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.588   4.600   6.392  10.700  16.200 
## -------------------------------------------------------- 
## WW$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## WW$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## WW$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## WW$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## WW$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## WW$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

In comparing the summaries of the variables together it appears the numbers

with the alcohol content show a trend toward higher levels as the quality

gets better, but the numbers on the residual sugars are not so apparent.

## [1] -0.3071233

## [1] -0.2099344

## [1] 0.05367788

## [1] -0.1136628

They also seen to have a lower density, lower chlorides (.2 - .6), somewhat

lower amount of sulphates, and fixed acidity between 4 - 8. It appears lower

the opposite would be true with the chlorides and possibly density.

## WW$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## WW$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## WW$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## WW$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## WW$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## WW$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## WW$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970
## WW$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## WW$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## WW$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## WW$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## WW$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## WW$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## WW$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

It appears the higher quality wines do have lower density and chlorides.

Apparently the less sweet and salty wines are of better quality as shown

by the summary comparisons.

## [1] 0.0890207

## [1] 0.5298813

## [1] -0.7801376

## [1] 0.8389665

There is definitely a linear relationship between density and total sulfur

dioxide as seen in the plot (.53) above as well as a negative linear

relationship between alcohol and density (-.78). After comparing density and

residual sugar, they appears to have a very strong linear relationship also

(.839). There doesn’t appear to be any relation between fixed acidity and

residual sugar. I guess I wasn’t suprised by any of the findings except the

density and total sulfur dioxide, I thought density was more tied to the

sugar/alcohol content.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in

the dataset?

The GGpairs plot was pretty interesting because it put everything together

and showed correlations between the variables. One of the biggest was the

correlation between density and residual sugars as well as density and

alcohol.

All the higher quality White Wines have a medium level of pH between 3-3.5,

higher level of alcohol content, mid-high citric acid level, lower residual

sugars, lower chlorides, lower density, and somewhat lower fixed acidity.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

appears the higher alcohol content it has the less dense the White Wine is.

What was the strongest relationship you found?

The strongest relationship appears to be between Density and Residual Sugar

(.839) because as sugar builds in the wines so does the density.

Mulivariate Plot Section

Adding factor to quality for ranking purposes.

I wanted to explore the density and alcohol relationship more due to the

strength it has with the quality factor of wines. It is pretty apparent

that the quality of wines goes up with the higher alcohol levels and lower

density levels which also directly affect each other.

Both plots show no real correlation between sulphates and total or free

sulfur dioxide on the quality of White Wine. I was curious because sulphates

tend to contribute to sulfur dioxide levels according to the description of

attributes.

This plot matrix brings all the combinations together in one easy to

view plot. Quality White Wines have an above average level of citric acid,

lower level of chlorides, higher level of alcohol, and medium to high level

of fixed acidity compared to the lower quality White Wines.

Once again the Quality of White Wines but this time with the Density and

Alcohol switched around showing the strong linear relationship as well as the

quality factor.

This is a better look at the different quality White Wines compared to

Density and Alcohol together and then seperate in order to see the

difference. I find it fascinating that alcohol and density are so closely

Density seems to be closely tied to the alcohol content as well as possibly

total sulfur dioxide. The higher the alcohol content the less dense the

White Wines appear to be.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

I was suprised to find that the higher quality White Wines seem to have a

higher alcohol content which in turn means a lower density. I thought that

the opposite would be true due to the taste of alcohol.

I was also suprised to find that the higher quality White Wines had a medium

to high level of citric acid as well as low levels of chlorides(salt).

Were there any interesting or surprising interactions between features?

I thought it was definitely interesting that as the alcohol content goes

up the density goes down.

OPTIONAL: Did you create any models with your dataset? Discuss the

strengths and limitations of your model.

I did not create a model.

Final Plots and Summary

Plot One

The Density and Residual Sugar of the White Wines have a strong linear

relationship as shown in the above plot

Plot Two

The higher quality White Wines have a higher alcohol content and lower

density than the lower quality White Wines. It also shows a strong linear

relationship between density and alcohol.

Plot Three

Both of these boxplots seems to back up my findings that higher citric

acid, higher alcohol content, and lower chlorides make a better quality

White Wine.

Reflection

This dataset contained 4898 observations of White Wine Quality with 11

inputs and 1 output. After exploring the data in detail I can say for certain

I know alot more about Wine than I have ever known. At first I was

concentrating strictly on what variables are needed to make a high quality

to alcohol content as well as residual sugars. The more dense the wine was

the less alcohol content it contained.

After examing the sulphates and sulfur dioxide I was very suprised to learn

they are not closely correlated as it mentioned in the description of

attributes that sulphates can contribute to sulfur dioxide gas levels. It

appears that density and total sulfur dioxide have a linear realtionship also

that could be futhur examined.

I had some trouble bringing together the multiple variables without including

quality, but as I moved away from the only output variable (quality) it

was apparent that there was some strong linear relationships between the

other variables. I was also stumped as to why density, residual sugar, and

research and saw the linear relationships between the three.

I think there is opportunity for furthur understanding of the makeup of

a good quality wine with more data on a wider number of white wines. Breaking

up the data into the 7 classes of whites would also allow you to gain more

understanding how the different variables that make up the wines react and

come together to form a high quality wine.